STEP - 1 IMPORTING IMPORTANT LIBRARIES
#os module provides functions for interacting with the operating system
import os
#numpy performs a wide variety of mathematical operations on arrays
import numpy as np
# pandas performs functions for analyzing, cleaning, exploring, and manipulating data
import pandas as pd
#ITS AN OPTIONAL STEP BUT USEFUL FOR FUTURE VISUALIZATIONS
# data visualization
#Seaborn: Python's Statistical Data Visualization Library
import seaborn as sns
sns.set_style('darkgrid')
sns.set(color_codes=True)
#Matplotlib tutorial takes you through the basics Python data visualization
# It helps to print graphs in Jupyter note book
%matplotlib inline
import matplotlib.pyplot as plt #visualisation
from matplotlib import pyplot as plt
from matplotlib import style
%matplotlib inline
# warnings module provides a way to control how warnings handled within a Python script.
import warnings
warnings.filterwarnings('ignore')
#a standard module used mathematical operations
import math
STEP - 2 UNDERSTANDING THE DATASET
import pandas as pd
adi = pd.read_csv('Adidas US Sales Dataset.csv')
#Shape is used find the number of rows and columns of the dataset
rows , cols = adi.shape
print("Number of Rows = ",rows)
print("Number of Columns = ",cols)
Number of Rows = 9648 Number of Columns = 13
print(len(adi.columns))
13
#ndim is used to get dimensions of the dataset
adi.ndim
2
adi # the dataset we took for the analysis
| Retailer | Retailer ID | Invoice Date | Region | State | City | Product | Price per Unit | Units Sold | Total Sales | Operating Profit | Operating Margin | Sales Method | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Foot Locker | 1185732 | 01-01-2020 | Northeast | New York | New York | Men's Street Footwear | $50.00 | 1,200 | $6,00,000 | $3,00,000 | 50% | In-store |
| 1 | Foot Locker | 1185732 | 02-01-2020 | Northeast | New York | New York | Men's Athletic Footwear | $50.00 | 1,000 | $5,00,000 | $1,50,000 | 30% | In-store |
| 2 | Foot Locker | 1185732 | 03-01-2020 | Northeast | New York | New York | Women's Street Footwear | $40.00 | 1,000 | $4,00,000 | $1,40,000 | 35% | In-store |
| 3 | Foot Locker | 1185732 | 04-01-2020 | Northeast | New York | New York | Women's Athletic Footwear | $45.00 | 850 | $3,82,500 | $1,33,875 | 35% | In-store |
| 4 | Foot Locker | 1185732 | 05-01-2020 | Northeast | New York | New York | Men's Apparel | $60.00 | 900 | $5,40,000 | $1,62,000 | 30% | In-store |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 9643 | Foot Locker | 1185732 | 24-01-2021 | Northeast | New Hampshire | Manchester | Men's Apparel | $50.00 | 64 | $3,200 | $896 | 28% | Outlet |
| 9644 | Foot Locker | 1185732 | 24-01-2021 | Northeast | New Hampshire | Manchester | Women's Apparel | $41.00 | 105 | $4,305 | $1,378 | 32% | Outlet |
| 9645 | Foot Locker | 1185732 | 22-02-2021 | Northeast | New Hampshire | Manchester | Men's Street Footwear | $41.00 | 184 | $7,544 | $2,791 | 37% | Outlet |
| 9646 | Foot Locker | 1185732 | 22-02-2021 | Northeast | New Hampshire | Manchester | Men's Athletic Footwear | $42.00 | 70 | $2,940 | $1,235 | 42% | Outlet |
| 9647 | Foot Locker | 1185732 | 22-02-2021 | Northeast | New Hampshire | Manchester | Women's Street Footwear | $29.00 | 83 | $2,407 | $650 | 27% | Outlet |
9648 rows × 13 columns
adi.columns
Index(['Retailer', 'Retailer ID', 'Invoice Date', 'Region', 'State', 'City',
'Product', 'Price per Unit', 'Units Sold', 'Total Sales',
'Operating Profit', 'Operating Margin', 'Sales Method'],
dtype='object')
adi.describe()
| Retailer ID | |
|---|---|
| count | 9.648000e+03 |
| mean | 1.173850e+06 |
| std | 2.636038e+04 |
| min | 1.128299e+06 |
| 25% | 1.185732e+06 |
| 50% | 1.185732e+06 |
| 75% | 1.185732e+06 |
| max | 1.197831e+06 |
adi.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 9648 entries, 0 to 9647 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Retailer 9648 non-null object 1 Retailer ID 9648 non-null int64 2 Invoice Date 9648 non-null object 3 Region 9648 non-null object 4 State 9648 non-null object 5 City 9648 non-null object 6 Product 9648 non-null object 7 Price per Unit 9648 non-null object 8 Units Sold 9648 non-null object 9 Total Sales 9648 non-null object 10 Operating Profit 9648 non-null object 11 Operating Margin 9648 non-null object 12 Sales Method 9648 non-null object dtypes: int64(1), object(12) memory usage: 980.0+ KB
STEP - 3 REMOVING IMPURE DATA
Impure data such as NA ,NaN , NULL , and empty spaces and values in the dataset
# this code is used to remove missing values
for col in adi.select_dtypes(include=['object']):
adi[col] = adi[col].str.strip()
adi.dropna(inplace=True)
#dropna method removes the rows that contains NULL values.
#dataset information after removing missing values and white spaces
adi.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 9648 entries, 0 to 9647 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Retailer 9648 non-null object 1 Retailer ID 9648 non-null int64 2 Invoice Date 9648 non-null object 3 Region 9648 non-null object 4 State 9648 non-null object 5 City 9648 non-null object 6 Product 9648 non-null object 7 Price per Unit 9648 non-null object 8 Units Sold 9648 non-null object 9 Total Sales 9648 non-null object 10 Operating Profit 9648 non-null object 11 Operating Margin 9648 non-null object 12 Sales Method 9648 non-null object dtypes: int64(1), object(12) memory usage: 980.0+ KB
From the above code we can understand that this data has no null and empty values
adi.head(5)
| Retailer | Retailer ID | Invoice Date | Region | State | City | Product | Price per Unit | Units Sold | Total Sales | Operating Profit | Operating Margin | Sales Method | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Foot Locker | 1185732 | 01-01-2020 | Northeast | New York | New York | Men's Street Footwear | $50.00 | 1,200 | $6,00,000 | $3,00,000 | 50% | In-store |
| 1 | Foot Locker | 1185732 | 02-01-2020 | Northeast | New York | New York | Men's Athletic Footwear | $50.00 | 1,000 | $5,00,000 | $1,50,000 | 30% | In-store |
| 2 | Foot Locker | 1185732 | 03-01-2020 | Northeast | New York | New York | Women's Street Footwear | $40.00 | 1,000 | $4,00,000 | $1,40,000 | 35% | In-store |
| 3 | Foot Locker | 1185732 | 04-01-2020 | Northeast | New York | New York | Women's Athletic Footwear | $45.00 | 850 | $3,82,500 | $1,33,875 | 35% | In-store |
| 4 | Foot Locker | 1185732 | 05-01-2020 | Northeast | New York | New York | Men's Apparel | $60.00 | 900 | $5,40,000 | $1,62,000 | 30% | In-store |
adi.tail(5)
| Retailer | Retailer ID | Invoice Date | Region | State | City | Product | Price per Unit | Units Sold | Total Sales | Operating Profit | Operating Margin | Sales Method | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 9643 | Foot Locker | 1185732 | 24-01-2021 | Northeast | New Hampshire | Manchester | Men's Apparel | $50.00 | 64 | $3,200 | $896 | 28% | Outlet |
| 9644 | Foot Locker | 1185732 | 24-01-2021 | Northeast | New Hampshire | Manchester | Women's Apparel | $41.00 | 105 | $4,305 | $1,378 | 32% | Outlet |
| 9645 | Foot Locker | 1185732 | 22-02-2021 | Northeast | New Hampshire | Manchester | Men's Street Footwear | $41.00 | 184 | $7,544 | $2,791 | 37% | Outlet |
| 9646 | Foot Locker | 1185732 | 22-02-2021 | Northeast | New Hampshire | Manchester | Men's Athletic Footwear | $42.00 | 70 | $2,940 | $1,235 | 42% | Outlet |
| 9647 | Foot Locker | 1185732 | 22-02-2021 | Northeast | New Hampshire | Manchester | Women's Street Footwear | $29.00 | 83 | $2,407 | $650 | 27% | Outlet |
adi.size
125424
#returns number of unique values for each variable
adi.nunique(axis=0)
Retailer 6 Retailer ID 4 Invoice Date 724 Region 5 State 50 City 52 Product 6 Price per Unit 94 Units Sold 361 Total Sales 3138 Operating Profit 4187 Operating Margin 66 Sales Method 3 dtype: int64
STEP - 4 FILTERING THE DATASET
#removes duplicate rows
adi= adi.drop_duplicates()
adi
| Retailer | Retailer ID | Invoice Date | Region | State | City | Product | Price per Unit | Units Sold | Total Sales | Operating Profit | Operating Margin | Sales Method | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Foot Locker | 1185732 | 01-01-2020 | Northeast | New York | New York | Men's Street Footwear | $50.00 | 1,200 | $6,00,000 | $3,00,000 | 50% | In-store |
| 1 | Foot Locker | 1185732 | 02-01-2020 | Northeast | New York | New York | Men's Athletic Footwear | $50.00 | 1,000 | $5,00,000 | $1,50,000 | 30% | In-store |
| 2 | Foot Locker | 1185732 | 03-01-2020 | Northeast | New York | New York | Women's Street Footwear | $40.00 | 1,000 | $4,00,000 | $1,40,000 | 35% | In-store |
| 3 | Foot Locker | 1185732 | 04-01-2020 | Northeast | New York | New York | Women's Athletic Footwear | $45.00 | 850 | $3,82,500 | $1,33,875 | 35% | In-store |
| 4 | Foot Locker | 1185732 | 05-01-2020 | Northeast | New York | New York | Men's Apparel | $60.00 | 900 | $5,40,000 | $1,62,000 | 30% | In-store |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 9643 | Foot Locker | 1185732 | 24-01-2021 | Northeast | New Hampshire | Manchester | Men's Apparel | $50.00 | 64 | $3,200 | $896 | 28% | Outlet |
| 9644 | Foot Locker | 1185732 | 24-01-2021 | Northeast | New Hampshire | Manchester | Women's Apparel | $41.00 | 105 | $4,305 | $1,378 | 32% | Outlet |
| 9645 | Foot Locker | 1185732 | 22-02-2021 | Northeast | New Hampshire | Manchester | Men's Street Footwear | $41.00 | 184 | $7,544 | $2,791 | 37% | Outlet |
| 9646 | Foot Locker | 1185732 | 22-02-2021 | Northeast | New Hampshire | Manchester | Men's Athletic Footwear | $42.00 | 70 | $2,940 | $1,235 | 42% | Outlet |
| 9647 | Foot Locker | 1185732 | 22-02-2021 | Northeast | New Hampshire | Manchester | Women's Street Footwear | $29.00 | 83 | $2,407 | $650 | 27% | Outlet |
9648 rows × 13 columns
# dropping the missing or null values
adi.isnull()
| Retailer | Retailer ID | Invoice Date | Region | State | City | Product | Price per Unit | Units Sold | Total Sales | Operating Profit | Operating Margin | Sales Method | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | False | False | False | False | False | False | False | False | False | False | False | False | False |
| 1 | False | False | False | False | False | False | False | False | False | False | False | False | False |
| 2 | False | False | False | False | False | False | False | False | False | False | False | False | False |
| 3 | False | False | False | False | False | False | False | False | False | False | False | False | False |
| 4 | False | False | False | False | False | False | False | False | False | False | False | False | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 9643 | False | False | False | False | False | False | False | False | False | False | False | False | False |
| 9644 | False | False | False | False | False | False | False | False | False | False | False | False | False |
| 9645 | False | False | False | False | False | False | False | False | False | False | False | False | False |
| 9646 | False | False | False | False | False | False | False | False | False | False | False | False | False |
| 9647 | False | False | False | False | False | False | False | False | False | False | False | False | False |
9648 rows × 13 columns
#dropping missing values
adi = adi.dropna()
adi
| Retailer | Retailer ID | Invoice Date | Region | State | City | Product | Price per Unit | Units Sold | Total Sales | Operating Profit | Operating Margin | Sales Method | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Foot Locker | 1185732 | 01-01-2020 | Northeast | New York | New York | Men's Street Footwear | $50.00 | 1,200 | $6,00,000 | $3,00,000 | 50% | In-store |
| 1 | Foot Locker | 1185732 | 02-01-2020 | Northeast | New York | New York | Men's Athletic Footwear | $50.00 | 1,000 | $5,00,000 | $1,50,000 | 30% | In-store |
| 2 | Foot Locker | 1185732 | 03-01-2020 | Northeast | New York | New York | Women's Street Footwear | $40.00 | 1,000 | $4,00,000 | $1,40,000 | 35% | In-store |
| 3 | Foot Locker | 1185732 | 04-01-2020 | Northeast | New York | New York | Women's Athletic Footwear | $45.00 | 850 | $3,82,500 | $1,33,875 | 35% | In-store |
| 4 | Foot Locker | 1185732 | 05-01-2020 | Northeast | New York | New York | Men's Apparel | $60.00 | 900 | $5,40,000 | $1,62,000 | 30% | In-store |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 9643 | Foot Locker | 1185732 | 24-01-2021 | Northeast | New Hampshire | Manchester | Men's Apparel | $50.00 | 64 | $3,200 | $896 | 28% | Outlet |
| 9644 | Foot Locker | 1185732 | 24-01-2021 | Northeast | New Hampshire | Manchester | Women's Apparel | $41.00 | 105 | $4,305 | $1,378 | 32% | Outlet |
| 9645 | Foot Locker | 1185732 | 22-02-2021 | Northeast | New Hampshire | Manchester | Men's Street Footwear | $41.00 | 184 | $7,544 | $2,791 | 37% | Outlet |
| 9646 | Foot Locker | 1185732 | 22-02-2021 | Northeast | New Hampshire | Manchester | Men's Athletic Footwear | $42.00 | 70 | $2,940 | $1,235 | 42% | Outlet |
| 9647 | Foot Locker | 1185732 | 22-02-2021 | Northeast | New Hampshire | Manchester | Women's Street Footwear | $29.00 | 83 | $2,407 | $650 | 27% | Outlet |
9648 rows × 13 columns
# change the datatype of units sold from string to float
adi['Units Sold'] = adi['Units Sold'].str.replace(',', '').astype(float)
# remove $ symbol from total sales and change the datatype of total sales to float
adi['Total Sales'] = adi['Total Sales'].str.replace('$', '')
adi['Total Sales'] = adi['Total Sales'].str.replace(',', '').astype(float)
STEP - 5 PLOTTING GRAPHS AND VISUALS USING MATPLOTLIB AND SEABORN
count of distinct states
# @title count of distinct states
adi['State'].nunique()
50
LINE CHART
A line chart displays data points connected by straight line segments, illustrating trends or relationships over time or other ordered categories.
# A line chart displays data points connected by straight line segments, illustrating trends or relationships over time or other ordered categories.
import matplotlib.pyplot as plt
# Line chart for number of units sold by State
sns.lineplot(x = 'State', y = 'Units Sold', data = adi)
plt.title("Units Sold by State")
plt.xlabel("State")
plt.xticks(rotation ='vertical')
plt.ylabel("Units")
plt.show()
HISTOGRAM
Histogram is a graphical representation of the distribution of numerical data, displaying frequencies of values within specified intervals.
# Histogram is a graphical representation of the distribution of numerical data, displaying frequencies of values within specified intervals.
import matplotlib.pyplot as plt
# Histogram
plt.hist(adi['Total Sales'],bins=10)
plt.title('Histogram')
plt.xlabel('Sales')
plt.ylabel('Frequency')
plt.show()
PIE CHART
Pie chart is a circular statistical graphic divided into slices to illustrate numerical proportion.
# Pie chart is a circular statistical graphic divided into slices to illustrate numerical proportion.
import matplotlib.pyplot as plt
# Pie chart
plt.pie(adi['Units Sold'].head(10), labels=adi['State'].head(10),autopct='%1.2f%%')
plt.title('Units Sold by State')
plt.show()
BAR GRAPH
Bar Chart is a visual representation of data using rectangular bars, where the length of each bar corresponds to the value it represents.
# Bar Chart is a visual representation of data using rectangular bars, where the length of each bar corresponds to the value it represents.
import matplotlib.pyplot as plt
# Bar plot for total sales by state
adi.groupby('State')['Total Sales'].sum().plot(kind='bar')
plt.title('Total Sales by State')
plt.xlabel('State')
plt.ylabel('Total Sales')
plt.show()
SCATTER PLOT
Scatter plot displays the relationship between two variables through dots on a Cartesian plane.
#Scatter plot displays the relationship between two variables through dots on a Cartesian plane
import matplotlib.pyplot as plt
# Scatter plot
sns.scatterplot(x='Units Sold', y='Total Sales', data=adi)
plt.title('Units Sold vs Total Sales')
plt.xlabel('Units Sold')
plt.ylabel('Total Sales')
plt.show()
BOX PLOT
Box plot graphically depicting groups of numerical data through their quartiles, visually highlighting the central tendency and variability of the data distribution.
# Box plot graphically depicting groups of numerical data through their quartiles, visually highlighting the central tendency and variability of the data distribution.
import matplotlib.pyplot as plt
# Box plot
sns.boxplot(x='State', y='Total Sales', data=adi)
plt.title('Total Sales by State')
plt.xlabel('State')
plt.xticks(rotation ='vertical')
plt.ylabel('Total Sales')
plt.show()
VIOLIN PLOT
Violin Plot visualizes numeric data, combining a box plot with a rotated kernel density plot.
# Violin Plot visualizes numeric data, combining a box plot with a rotated kernel density plot.
import matplotlib.pyplot as plt
# Violin plot
sns.violinplot(x='State', y='Total Sales', data=adi)
plt.title('Total Sales by State')
plt.xlabel('State')
plt.xticks(rotation ='vertical')
plt.ylabel('Total Sales')
plt.show()
COUNT PLOT
Count plot is a type of bar plot that shows the count of observations in each category.
# Count plot is a type of bar plot that shows the count of observations in each category.
import matplotlib.pyplot as plt
# Count plot
sns.countplot(x='State', data=adi)
plt.title('Statewise Product Count')
plt.xlabel('State')
plt.xticks(rotation ='vertical')
plt.ylabel('Count')
plt.show()
SUBPLOTS
A subplot is a plotting area that allows for multiple plots to be arranged within the same figure, enabling comparison or combination of different visualizations.
# A subplot is a plotting area that allows for multiple plots to be arranged within the same figure, enabling comparison or combination of different visualizations.
import matplotlib.pyplot as plt
# Subplots
fig, axes = plt.subplots(2, 2)
# First subplot
sns.lineplot(x='State', y='Units Sold', data=adi, ax=axes[0, 0])
axes[0, 0].set_title('Units Sold by State')
# Second subplot
sns.histplot(data=adi['Total Sales'], ax=axes[0, 1])
axes[0, 1].set_title('Histogram of Total Sales')
# Third subplot
sns.scatterplot(x='Units Sold', y='Total Sales', data=adi, ax=axes[1, 0])
axes[1, 0].set_title('Units Sold vs Total Sales')
# Fourth subplot
sns.boxplot(x='State', y='Total Sales', data=adi, ax=axes[1, 1])
axes[1, 1].set_title('Total Sales by State')
plt.tight_layout()
plt.show()
DISTRIBUTION PLOT
A distribution plot visualizes the distribution of a dataset, typically showing the frequency or probability of different values and often including a kernel density estimate to smooth the distribution curve.
# A distribution plot visualizes the distribution of a dataset, typically showing the frequency or probability of different values and often including a kernel density estimate to smooth the distribution curve.
import matplotlib.pyplot as plt
sns.distplot(adi['Units Sold'])
plt.title('Distribution of Units Sold')
plt.xlabel('Units Sold')
plt.ylabel('Density')
plt.show()
PAIR PLOT
A pairplot is a matrix of scatterplots and histograms, allowing for visualization of pairwise relationships and distributions between multiple variables in a dataset.
# A pairplot is a matrix of scatterplots and histograms, allowing for visualization of pairwise relationships and distributions between multiple variables in a dataset.
import matplotlib.pyplot as plt
sns.pairplot(adi)
plt.show()
JOINT PLOT
Joint plot displays the relationship between two variables along with their individual distributions.
#Joint plot displays the relationship between two variables along with their individual distributions.
import matplotlib.pyplot as plt
sns.jointplot(x='Units Sold', y='Total Sales', data=adi)
plt.title('Units Sold vs Total Sales')
plt.show()
RUG PLOT
Rug plot displays individual data points on a single axis.
# Rug plot displays individual data points on a single axis.
import matplotlib.pyplot as plt
sns.rugplot(adi['Total Sales'])
plt.title('Rug Plot of Total Sales')
plt.xlabel('Total Sales')
plt.show()
STRIP PLOT
Strip plot is a scatter plot where one variable is categorical and the other is continuous, with points aligned along the categorical axis.
# Strip plot with points aligned along the categorical axis.
import matplotlib.pyplot as plt
sns.stripplot(x='State', y='Units Sold', data=adi)
plt.title('Units Sold by State')
plt.xlabel('State')
plt.xticks(rotation='vertical')
plt.ylabel('Units Sold')
plt.show()
STEP - 6 PLOTTING GRAPHS AND VISUALS USING PLOTLY
*PLOTLY : Plotly is a powerful open-source library that allows you to create stunning, interactive visualizations of your data in Python. It goes beyond static charts and graphs, enabling you to create web-based plots that users can zoom, pan, hover over for details, and interact with in various ways. This makes Plotly ideal for exploring, understanding, and presenting data in a clear and engaging manner.*
3D SCATTER PLOT
# plot a graph using plotly a 3d gragh with the above dataset
import pandas as pd
import plotly.express as px
fig = px.scatter_3d(adi, x='Total Sales', y='Operating Profit', z='Units Sold', color='Region')
fig.show()
SCATTER PLOT
# SCATTER PLOT WITH PLOTLY
fig = px.scatter(adi, x="Total Sales", y="Operating Profit", color="Region", title="Scatter Plot of Total Sales vs Operating Profit by Region")
fig.show()
BAR CHART
# BAR CHART using plotly
fig = px.bar(adi, x="State", y="Units Sold", color="Region", title="Bar Chart of Units Sold by State and Region")
fig.show()
HISTOGRAM
# HISTOGRAM
fig = px.histogram(adi, x="Total Sales", nbins=20, title="Histogram of Total Sales")
fig.show()
PIE CHART
# PIE CHART
fig = px.pie(adi, values="Units Sold", names="Region", title="Pie Chart of Units Sold by Region")
fig.show()
BOX PLOT
# BOX PLOT
fig = px.box(adi, x="Region", y="Total Sales", title="Box Plot of Total Sales by Region")
fig.show()
VIOLIN PLOT
# VIOLIN PLOT
fig = px.violin(adi, x="Region", y="Total Sales", title="Violin Plot of Total Sales by Region")
fig.show()
3D SCATTER PLOT BY TOTAL SALES,OPERRATING PROFIT AND UNITS SOLD BY REGION
# **3D SCATTER PLOT**
fig = px.scatter_3d(adi, x="Total Sales", y="Operating Profit", z="Units Sold", color="Region", title="3D Scatter Plot of Total Sales, Operating Profit, and Units Sold by Region")
fig.show()